Enabled running Pallas Flash Attention on CPU. #922

ds-hwang · 2025-01-14T04:59:54Z

Enabled running Pallas Flash Attention on CPU.

Pallas supports CPU simulation (interpret=True), so we can use the same
TPU Pallas kernel on CPU — making code debugging easier.

This change lets the following unittests run on CPU as if they were on TPU,
enabling easier testing and debugging:

axlearn/common/flash_attention/tpu_attention_test.py

Similarly, gpu_attention_test.py can also be run on CPU as if they were on GPU.

axlearn/common/flash_attention/gpu_attention_test.py

Now CI covers those tests on CPU as well.
In M3 Max MacBook Pro, test coverages and processing time are as follows,

axlearn/common/flash_attention/gpu_attention_test.py: 3024 passed, 1345 skipped in 200.38s (0:03:20)
axlearn/common/flash_attention/tpu_attention_test.py: 18 passed, 435 skipped in 34.82s

ds-hwang · 2025-01-14T05:00:22Z

@ruomingp Could you take a look? From 975

ruomingp

A few thoughts missed in earlier reviews...

ruomingp · 2025-01-14T06:09:22Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -152,6 +153,8 @@ def test_decode_against_ref(
        kv_head_factor: int,
        window_len: int,
    ):
+        if jax.default_backend() != "gpu" and seq_len > 1024:


Nit: can we check it against "cpu" directly instead of != "gpu"?

ruomingp · 2025-01-14T06:10:26Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -346,6 +357,9 @@ def test_cudnn_against_triton_ref(
    causal: bool,
    dtype: jnp.dtype,
 ):
+    if jax.default_backend() == "cpu":


Likewise, let's avoid assuming that the backend is either gpu or cpu in multiple places.

Suggested change

if jax.default_backend() == "cpu":

if jax.default_backend() != "gpu":

I'll leave this code as-is as you asked

Nit: can we check it against "cpu" directly instead of != "gpu"?

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

if jax.default_backend() not in ("gpu", "cpu"):

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

I know you are making this assumption, but such dependency is fragile---what if we extend the supported backends in the future?

In this case, requiring the backend to be "gpu" is both more robust and readable. What's the downside?

ruomingp · 2025-01-14T06:10:48Z

axlearn/common/flash_attention/gpu_attention_test.py

+    if jax.default_backend() == "cpu":
+        pytest.skip(reason="cudnn function needs GPU.")


And here and elsewhere.

As mentioned above, keep using jax.default_backend() == "cpu":

ruomingp · 2025-01-14T06:12:42Z

axlearn/common/flash_attention/tpu_attention_test.py

-        seq_len=[1024, 32768],
+        seq_len=[1024],


Since the sliding window size is 1024, it will be useful to keep a test case for seq_len > 1024. We can enable the test only on TPU if it's too slow on CPU. We can also use a seq_len such as 2048 for cpu if it's fast enough.

Done. I changed it back to resume the first PR's code.

We had this thread in 975

@ruomingp Do we need to support seq_len up to 1024? If the block size is 128, supporting <= 256 should be enough?

@ds-hwang Agreed. I removed 32k test with this if-statement.

ruomingp · 2025-01-14T06:14:10Z

axlearn/common/flash_attention/utils.py

                softmax_scale=softmax_scale,
                block_size=block_size,
+                interpret=(backend == "cpu"),


Given how often we do this across locations, I wonder if we can do the following:

Make interpret default to None (instead of False);

If it's None, assume interpret=True if the backend is "cpu";

WDYT?

Thank you for your suggestion. interpret=True applies only to the Pallas kernel. Therefore, having an interpret variable in the flash layer is not aligned with the appropriate level of abstraction—neither the JAX fallback nor the cudnn code paths needs this variable.

Additionally, this line was added so contributors can easily debug the Pallas kernel on the CPU. For instance, changing the if statement to:

elif backend in ("cpu", "tpu"):

would allow debugging in layer_test.py.

Pallas supports CPU simulation (`interpret=True`), so we can use the same TPU Pallas kernel on CPU — making code debugging easier. This change lets the following unittests run on CPU as if they were on TPU, enabling easier testing and debugging: - `axlearn/common/flash_attention/tpu_attention_test.py` Similarly, `gpu_attention_test.py` can also be run on CPU as if they were on GPU. - `axlearn/common/flash_attention/gpu_attention_test.py` Now CI covers those tests on CPU as well. In M3 Max MacBook Pro, test coverages and processing time are as follows, * axlearn/common/flash_attention/gpu_attention_test.py: 3024 passed, 1345 skipped in 200.38s (0:03:20) * axlearn/common/flash_attention/tpu_attention_test.py: 18 passed, 435 skipped in 34.82s

ds-hwang

Thank you for review. I responded all comments. Could you check it again?

ds-hwang · 2025-01-14T16:19:30Z

axlearn/common/flash_attention/utils.py

                softmax_scale=softmax_scale,
                block_size=block_size,
+                interpret=(backend == "cpu"),


Thank you for your suggestion. interpret=True applies only to the Pallas kernel. Therefore, having an interpret variable in the flash layer is not aligned with the appropriate level of abstraction—neither the JAX fallback nor the cudnn code paths needs this variable.

Additionally, this line was added so contributors can easily debug the Pallas kernel on the CPU. For instance, changing the if statement to:

elif backend in ("cpu", "tpu"):

would allow debugging in layer_test.py.

ds-hwang · 2025-01-14T16:19:59Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -152,6 +153,8 @@ def test_decode_against_ref(
        kv_head_factor: int,
        window_len: int,
    ):
+        if jax.default_backend() != "gpu" and seq_len > 1024:


ds-hwang · 2025-01-14T16:21:02Z

axlearn/common/flash_attention/gpu_attention_test.py

@@ -346,6 +357,9 @@ def test_cudnn_against_triton_ref(
    causal: bool,
    dtype: jnp.dtype,
 ):
+    if jax.default_backend() == "cpu":


I'll leave this code as-is as you asked

Nit: can we check it against "cpu" directly instead of != "gpu"?

In addition, at the begin of file, it allows only "gpu" and "cpu". So == "cpu" is != "gpu" in this code.

if jax.default_backend() not in ("gpu", "cpu"):

ds-hwang · 2025-01-14T16:23:02Z

axlearn/common/flash_attention/gpu_attention_test.py

+    if jax.default_backend() == "cpu":
+        pytest.skip(reason="cudnn function needs GPU.")


As mentioned above, keep using jax.default_backend() == "cpu":

ds-hwang requested review from ruomingp, markblee and a team as code owners January 14, 2025 04:59

ds-hwang force-pushed the flsh_cpu branch from e924937 to ed75e10 Compare January 14, 2025 06:07

ruomingp reviewed Jan 14, 2025

View reviewed changes

ds-hwang force-pushed the flsh_cpu branch from ed75e10 to 0f08e6b Compare January 14, 2025 16:24

ds-hwang commented Jan 14, 2025

View reviewed changes

ds-hwang requested a review from ruomingp January 14, 2025 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled running Pallas Flash Attention on CPU. #922

Enabled running Pallas Flash Attention on CPU. #922

ds-hwang commented Jan 14, 2025

ds-hwang commented Jan 14, 2025

ruomingp left a comment

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025 •

edited

Loading

ruomingp Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang left a comment

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

ds-hwang Jan 14, 2025

	if jax.default_backend() == "cpu":
	if jax.default_backend() != "gpu":

		if jax.default_backend() == "cpu":
		pytest.skip(reason="cudnn function needs GPU.")

Enabled running Pallas Flash Attention on CPU. #922

Are you sure you want to change the base?

Enabled running Pallas Flash Attention on CPU. #922

Conversation

ds-hwang commented Jan 14, 2025

ds-hwang commented Jan 14, 2025

ruomingp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-hwang Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-hwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ds-hwang Jan 14, 2025 •

edited

Loading